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Abstract 

In the context of non overlapping domain decomposition methods, several algebraic approximations of the 
Dirichlet-to-Neumann (DtN) map are proposed in [F. X. Roux, et. al. Algebraic approximation of Dirichlet- 
to-Neumann maps for the equations of linear elasticity, Comput. Methods Appl. Mech. Engrg., 195, 2006, 
3742-3759]. For the case of non overlapping domains, approximation to the DtN are analogous to the 
approximation of the Schur complements in the incomplete multilevel block factorization. 

In this work, several original and purely algebraic (based on graph of the matrix) domain decomposition 
techniques are investigated for steady state incompressible Navier-Stokes equation defined on uniform and 
stretched grid for low viscosity. Moreover, the methods proposed are highly parallel during both setup and 
application phase. Spectral and numerical analysis of the methods are also presented. 



1 Introduction 

At the core of some numerical simulations lies the problem of solving sparse linear systems of the form 

Cx = b, (1) 

where C e M^^^, x e M^, b e M^. One of the sources of equation |l]) is the following time evolving 
Navier Stokes equation 

in rj, (2) 

in r2, (3) 
on F, (4) 

where v > Q is the kinematic viscosity coefficient (inversely proportional to Reynolds number Re), A is the 
Laplace operator in W'-, V denotes the gradient, V- stands for divergence, and B is a boundary operator. 
The domain Vt C W''{d = 2, 3) is the bounded, connected domain with a piecewise smooth boundary F. Here 
/ : n — )• M'^, the boundary data given hy g : F — > M'^. The system models the flow of an incompressible 
Newtonian fluid such as air or water. The presence of non-linear term u ■ Vu indicates presence of multiple 
solutions. We are interested in finding the velocity field u : — >■ M'' and a pressure field p : — >■ E that 
satisfies the equations ([2]), ([3|, and Q above. Implicit time discretization such as Crank-Nikelson scheme 
together with spatial discretization such as finite element scheme [TT] of the Navier-Stokes system above 
leads to a linear system ([I]) where the matrix C has the following block 2x2 partitioned form 

^This work was completed in part time when the author was provided office facilities and access to journals by Institut Henri 
Poincare, UMS 839 (CNRS/UPMC) while taking part in trimester program on control and PDE (Oct. - Dec, 2010), and Fonds 
de la recherche scientifique (FNRS)(Rof; 2011/V 6/5/004-IB/CS-15), Belgique. 



vAu + (u ■ V)u + Vp = f 

at 

V • ?i = 

B?i = g 
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c = 



D 




-E 






Here E'^ is the discrete gradient and —E is the negative divergence operator. The structure of the matrix D 
depends on the nonhnear algorithm; for Picard iteration D is block diagonal with each diagonal block being 
the discrete convection-diffusion operator. For Newton iteration, D has more complex structure The 
matrix C is indefinite and non-symmetric. 

In a real life simulation, the coefficient matrix C is usually large and sparse, and the usual direct methods 
[lOj are costly both in terms of CPU time and storage requirements. A common preference is to use 
Krylov subspace methods with some suitable preconditioning. A preconditioner B is an approximation to 
the coefficient matrix C such that the preconditioned operator B"-'-C has "favourable" spectrum that is 
essential for a fast convergence of Krylov subspace based iterative methods ^31, . In general, the number of 
iterations required for convergence is less when the eigenvalues are clustered towards one and they are away 
from zero. On the other hand, the time required to setup the preconditioner, and the cost of applying it 
during the iteration phase should not be too demanding. 

Most of the classical and recent preconditioners for the Navier-Stokes systems are approximate block fac- 
torization (ABF). Classical pressure correction methods are SIMPLE (Semi-Implicit Method for Pressure 
Linked Equations), and their modifications SIMPLEC, and SIMPLER gUMlllSKig. A promising class of 
method based on approximation of the Schur complement {S = ED~^E^) is a pressure convection diffusion 
(PCD) preconditioner, where, the Schur complement is approximated by 

S = EE'^DpK (5) 

Thus, the approximation S is obtained by first approximating D by Dp and then commutating D~^E^ 
to get Dp on the right. Here Dp is the discrete convection-diffusion operator projected on the pressure 
space [TT]. The method leads to convergence rates that are independent of the mesh size but deteriorates 
with Reynolds number higher that 100 as confirmed in the numerical experiments section [T31[I11[IH]- A 
modification of PCD is the least squares commutator (LSC) preconditioner, where the construction of the 
discrete convection-diffusion operator projected on the pressure space is automated by solving the normal 
equations associated with a least square problem derived from the commutator in |12j . Another approach is 
based on the Hermitian or Skew Hermitian Splitting (HSS) and Dimensional Splitting (DS) of the problem 
along the components of the velocity field and its relaxed version are introduced in Here, HSS has not 

been implemented efliciently for Oseen problems and DS preconditioning suffers poor convergence on low 
viscosity problems on stretched grid. In general, these methods belong to class of preconditioners where the 
preconditioner has block 2x2 form, and for parallelism, they rely on the scalability of the inner solvers for 
(1,1) and Schur complement block in the preconditioner. 

In this paper, we concern ourselves with substructured domain decomposition (DD) based preconditioner. 
On contrary to overlapping DD methods, the non-overlapping DD methods tries to approximate the so- 
called Dirichlet to Newmann (DtN) map [HI [501 [U [12 IS] . For its algebraic counterpart, approximation 
of DtN map is related to the approximation of the Schur complement [22l [28l [29] . In the non-overlapping 
DD method considered in this paper, the required substructuring (partitioning) is obtained by a graph 
partitioner |171 132j thus leading to a purely algebraic method where the graph of the matrix is sufScient and 
nothing else is assumed of the computational domain and boundary conditions. The partitioner finds a set 
of nodes (separator), removal of which leaves the graph disconnected into as many disconnected components 
as required. That is, the matrix C above is transformed to P-^CP where P is the permutation matrix 
that resuffies the rows and columns. Thus, the resulting permuted matrix can now be partitioned into 
block 2x2 form such that the (1,1) block is block diagonal with as many blocks as desired. We notice 
here that such reordering techniques are popular and almost always taken into consideration when degisning 
direct [TU] or hybrid methods [311 [I] to reduce the amount of fill-in and to enhance parallelism [31] . The 
main contribution of our work is the degisn of a parallel computation of Schur complement. In [28) . some 
algebraic approximations are considered for the Schur complement; in one of the methods the global Schur 
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complement is approximated in patches. The computation of these patch Schur complements ultimately 

leads to an approximation of the global Schur complement. In this paper, wc propose some modifications 
in the approximation of the Schur complements. Rather than building a patch around a node, we consider 
an aggregated set of nodes, and build a mini Schur complement approximation (MSG) for all the nodes 
of the aggregate at once. Apart from this basic modification, wc propose to construct "patches" based on 
the numbering of the nodes rather than on the edge connections. In other words, patches consists of closely 
numbered nodes rather than closely connected ones. This new approach leads to much faster approximations. 
The method is purely algebraic and takes matrix and right hand side as an input, and easily integrated in 
a non-linear solver. Compared to two state-of-the-art ABF methods namely PCD and LSC, the proposed 
methods are attractive for several reasons: 

• The setup and application phase of the preconditioners are massively parallel 

• The new methods converges faster and in particular, compared to PCD and LSC methods, they perform 
significantly better for Reynolds number larger than 100. 

• The diagonal blocks of the (1,1) blocks are approximated by incomplete LU factorizations leading to 
sparse factors for the preconditioner. 

Although, the method can be tried on any problem, wc concern ourselves with problems steming from 
Navier-Stokes equation for Reynolds number ranging from 10 to 3000. 

The remainder of this paper is organized as follows. In section (2), we explain briefly the PCD and LSC 
methods. In section (3), we introduce the new methods based on mini Schur complements, the importance 
of overlapping the patches will be studied. In section (4), we discuss the parallelism and implementation 
aspects for the new methods. Finally, in section (5), we present the numerical experiments and we compare, 
the fill factor, the iteration count, and the robustness of the methods. 

2 Some preconditioners for the incompressible Navier-Stokes equa- 
tions 

In this section, we briefly describe the pressure convection diffusion (PCD) and least squares commutator 
(LSC) preconditioners. These methods will serve as a benchmark methods for comparison. 

2.1 Pressure Convection Diffusion 

Let the discrete convection diffusion operator on the velocity space be defined as follows 



where Wh is the approximation to the discrete velocity in the recent Picard iteration. Consider a similar 
operator is defined on the pressure space 



(6) 



ifp = {-vd'' + Wh ■ V), 



(7) 



Let e denote the commutator of both these operator with the gradient operator as follows 



p 



(8) 



The discrete analog of the commutator is given as follows 



(9) 
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Here Q and Qp are velocity and pressure mass respectively. The transformation from integrated to nodal 
values is done by the inverse operations of velocity and pressure mass matices. Here Dp is the discrete 
convection-diffusion operator on pressure space. Following approximation to the Schur complement 

ED-^E'^ « EQ-^E'^D-^pQp (10) 

is obtained by assuming that {ED^^Q)e{Fp^^Qp) is close to zero. We observe that E being discrete di- 
vergence oeprator and E'^ being the gradient operator, EE^ is the discrete Laplacian and EQ^^E'^ is the 
scaled discrete Laplacian. Here EQ^^E^ being expensive is replace by its spectrally equivalent pressure 
mass matrix Ap, thus leading to a final approximation S of the Schur complement to be 

S^~ApD-^Qp (11) 
The PCD preconditioner denoted by BpcD is defined as follows 



D 


E^ - 





S 



2.2 Least Square Commutator 



One of the drawbacks of the PCD method is that we need to construct the convection-diffusion operator Dp 
projected on the pressure space which essentially requires an full understanding of the underlying discretiza- 
tion scheme and other impelmentation details. In [131 112| . Elman et. al. propose to find Dp automatically 
by solving a least square problem of the form min||e?i||Q, i.e., by minimizing 



rmn\\{Q-'DQ-'E^)j - E^Q-\Dp)j\\Q, 



(12) 



where || • Hq is x^Qx norm, and {K)j for any matrix K denotes the j*'' column of K. The normal equations 
associated with this problem are given as follows 

Qp^EQ-^E^Q-\Dp), - iQpEQ-^DQ-'E^), (13) 

which gives 

Dp = Qp{EQ-^E^y\EQ-^DQ-^E^). 
Substuting the expression for Dp in (10 1, we obtain an approximation to the Schur complement as follows 

s = {eq-^e^){eq-^dq-^e'^)-\eq-^e'^). 



ed-'^e'^ 



Solving with this approximation requires two discrete poisson solve (scaled Laplacian) which can be handled 
efficiently by multigrid methods [30l |34j [35l |8] , and in contrast to other Schur complement based methods, 
we only need a matrix vector product with D. The LSC preconditioner denoted by Blsc is defined as 
follows 



B 



LSC — 



D 


et ■ 





s 



3 Mini Schur complements 
3.1 Graphical view 

In this section, we introduce an aggregation based mini Schur complement. Consider again the following 
block 2x2 partitioned matrix 
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c = 



" D 


E ' 


F 


G 



Let S = G — FD~^E denote the global Schur complement. Furthermore, let @, S', ^ , and denote the 
set of vertices corresponding to the adjacency graph of matrices D, E, F, and G respectively. Also, assume 
a local numbering of nodes in ^ and ^, and for simplicity, we assume that the number of nodes in ^ is 
greater than the number of nodes in ^. The MSCs are constructed in the following steps. 



1. Choose a set of aggregated nodes in graph ^ 'f = {^p^,^p^, . . . ,5^^^}, '^p^ C i^, $fp. fl ^p. = for 
* 7^ J ) I H Pi ^^'^ ^v^vi — possible choice of aggregation is simply choosing the nodes with 
consecutive numbering as follows 

r = {{1, 2, . . . {vx + + 2, . . . ,pi +P2}, . . . }. (14) 

Note here that the aggregation y is done based on the "numbering" of the nodes in the grid, rather 
than on the "closeness" of the nodes determined by looking at the edge connectivity between the nodes. 
For example, in the case of 2D n x n grid, the node numbered i could be at a distance (or path length) 
i + n from node numbered i + 1. 

Remark 1 14^6 notice here that a simple generalization of the above aggregation schem.e is obtained 
when we allow overlapping between the aggregates, i.e., for overlapped aggregation scheme we con- 
sider r = %^ C 'S, forii^ j, I 1= Pi and = ^. As 
in the case of overlapped Schwarz methods, overlapping increases sharing of inform,ation between the 
MSCs thus leading to an improved approximation of the global Schur complement. This overlapped 
aggregation scheme could be applied to all the methods that follow thus leading to an improvement in 
the approximation of the method concerned. However, it is to be noted that overlapping leads to lack of 
parallelism during the solve phase since the global approximated Schur complement is no longer block 
diagonal. 

2. Next, we have following three possible approximation schemes 

(a) Mini Schur complements based on edge connectivity (MSCE): For each aggregated nodes 
'S^p. , identify a set of nodes in the set ^ such that for each node of Sfp . , there exist a node 
in within a path length of r^. That is, we can reach a node in $#p^ from at least one node 
in by traversing a path of length less than or equal to fj in the adjacency graph of matrix 
C. The edges between the two aggregates and '^p. are denoted by Ei (incoming edge) and 
Fi (outgoing edge). When this method is defined for the overlapping aggregates, we shall call it 
OMSCE (overlapped mini Schur complement based on edge connectivity). 

(b) Mini Schur complements based on numbering (MSCN): For each aggregated nodes i^p., 
identify a set of nodes in the graph ^ by setting ^p. = ^p. . Remember that the graphs ^p. and 

'^p^ have local numbering of the nodes. Thus, we identify aggregates which have same numbering 
in their respective graph. When this method is defined for the overlapping aggregates, we shall 
call it OMSCN (overlapped mini Schur complement based on numbering). 

(c) Lumped approximation of Schur complement (Lump): Do not do anything for Lump 
method. When this method is defined for the overlapping aggregates, we shall call it OLUM 

(overlapped Lumped method.) 

3. Here again wc have three cases as follows 

(a) Computation of mini Schur complements for MSCE: Assemble the nodes of the aggregate 
't^p. and the edge connections between the nodes in the matrix Gp. . Similarly, Assemble the nodes 
of aggregates and the edge connections between the nodes in the matrix D^.. For a non- 
symmetric matrix C, the corresponding graph is considered as a directed graph where the entries 
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below the diagonal of the matrix may represent the incoming edge and those above the diagonal 
are outgoing edges. Let Fi be the matrix which stores the incoming edges and Ei denotes the 
outgoing edges. We consider the following sub matrix for the ith aggregate 





' F>r, 


1 









The corresponding mini Schur complement is given as follows 

Si = Gp. — Fi{Dri) ^Ei 

Here Si is called the mini Schur complement. In case, the matrix is large we may approximate 
Si by 

S^^Gp^-F,{Dr,)-^Ea (15) 

where 1 = [1,1,1,..., 1]"^. We call this method MSCER, where R stands the rowsum. 

(b) For MSCN: 

We have shown the computation of Schur complement for MSCE but we can proceed in a similar 
way to compute the mini Schur complement for MSCN method by replacing D^- by Dp. above. In 
the sections that follows, we shall give a matrix view of the MSCN method. A simple generalization 
of MSCN method is to take the slightly larger and not necessarily the same size as G^ ■ A 
colsum approximation of the Schur complement as done above for MSCE method will be called 
MSCNR. 

(c) For Lump: The corresponding i*'^ mini Schur complement for the lumped approximation is 
given as follows 

Si = Gp; 

4. Extract all the columns corresponding to the nodes of the aggregate Si corresponding to the nodes in 
the aggregate 'i^p. and put them in the corresponding columns in the global Schur complement. This 
step is same for MSCN and Lump method. In other words, set S = blkDiag(S'i). 



3.2 Illustration of MSCN method with an example 

In this section, we illustrate the MSCN method graphically for an small example, but for implementation, 
we refer the reader to Algorithms ([!]) and 

In Figure ([T]), the two sub-figures illustrate the steps involved in forming a mini-Schur complement. In each 
sub figure, there are two vertical planar graphs, and each graph keeps their own local numbering of the 
nodes. The left vertical planar graph within each sub figure corresponds to matrix G, and the right graph 
corresponds to the matrix D. The whole graph corresponds to the matrix C. The steps involved in building 
a mini-Schur complement, in this case, are illustrated as follows 



1. Choose aggregates a priori for graph of G. One possible aggregate is 

r = {{1, 2}, {3, 4, 5}, {6, 7, 8, 9}, {10, 11}, {12, 13}, {14, 15, 16}}. 
The 4rth aggregate {10, 11} are denoted by solid spheres in the left sub figure of Figure Q. 

2. For MSCN method, we identify the nodes in graph of which have direct edges to the nodes of 
the aggregate {10, 11}. The identified nodes are {10, 11}, i.e., the nodes shaded with horizontal line 
patterns. 
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3. Finally assemble the following matrix 



C4 = 



£'(10,10) 
£'(11,10) 


£(10,11) 
£(11,11) 


£(10,10) 
£(11,10) 


£(10,11) " 
£(11,11) 


£(10,10) 
£(11,10) 


£(10,11) 
£(11,11) 


G(10,10) 
G(ll,10) 


G(10,ll) 
G(ll,ll) 



The matrix above is presented with entries with local numbering. The assembled matrix in terms of 
the global numbering is the following 



C4- 



G(10,10) 
G(ll,10) 


G(10,ll) 
G(ll,ll) 


G(10,26) 
G(ll,26) 


G(10,27) " 
G(ll,27) 


G(26,10) 
G(27,10) 


G(26,ll) 
G(27,ll) 


G(26,26) 
G(27,26) 


G(26,27) 
G(27, 27) 



4. Denote the above block 2x2 matrix as follows 



^4 


^4 " 


£4 


G4 



Thus, the mini Schur complement for the Arth aggregate is given as follows 

5*4 = G4 - £4(£'4)"^£4 (16) 

Finally, the columns corresponding to the Arth aggregate, i.e., for the aggregate {10, 11} in the mini 
Schur complement S4 are S'4(:, 1) and S'4(:,2) in the familiar Matlab notation. We substitute these 
columns in the global Schur complement S. In the global numbering, the columns 6*4(1, :) and 64(2, :) 
maps to the column numbers 6(10,:) and S'(ll, :) of the global Schur complement to be estimated. 
Thus we do the following update to the matrix S 

^(10,10: 11)^54(1,:)^, 
S'(ll,10 : 11) = 64(2,:)^. 

Note that we have used the convenient Matlab notation above. 
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3.3 Matrix view of the MSCN method 



In this section, we shall present a matrix view of the MSCN method. Although, the graph view above and 
the matrix view are both same method. The matrix view will lead to a clearer picture and facilitates proving 
results. 

For MSCN method, no reordering or partitioning is required for the method to work. However, using efficient 
partitioning techniques usually leads us to a considerable saving in the flop count and enhance parallelism. 
Thus, in what follows, we shall assume that a suitable partitioning and reordering has been applied using 
suitable graph partitioner. From now onwards, we use the same notations for the sub-matrices D and E of 
C in the reordered matrix P'^CP. 

As above, we start with the block 2x2 partitioned system as follows 



C = 



" D 


E ' 


F 


G 



Then we partition the matrix further as follows 



C = 





D12 


Ell 


E12 


D21 


D22 


E21 


E22 


Fn 


F12 


Gil 


G12 


F21 


F22 


G21 


G22 



Now we construct a sparse approximation of matrix C above by dropping the blocks Djj , E-ij , Fij , and Gij 
for which i ^ j. As a result, we obtain a first level sparse approximation denoted by C2 as follows 







En 






D22 




E22 


Fn 




Gn 






F22 




G22 



Here the subscript 2 in C2 denotes the number of principle sub matrices of the matrix G. In this case, the 
two principle sub matrices we have retained are Gn and G22- 

For the second level of partition, we further partition the blocks Du, En, Fn, and Gu to get a sparse matrix 
of C as follows 



Dn 


D12 




En 


E12 




D21 


D22 


E21 


E22 




D33 


D34 




E33 


E34 


Di3 


D44 


E43 


E44 


Fn 


F12 




Gn 


G12 




F21 


F22 


G21 


G22 




F33 


F34 




G33 


G34 


F43 


F44 


G43 


G44 



Now as before, we construct a sparse approximation of the above matrix by dropping the blocks Dij,Eij, Fij, 
and Gij for which i ^ j. We obtain a second level sparse approximation denoted by C4 as follows 
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C4 = 



Dii 






Ell 












E22 




D33 






E33 






D44 




E44 


Fn 






Gil 








F22 




G22 




F33 






G33 






F44 




G44 



Eliminating the blocks Fa using as pivots, we obtain an approximation to the original Schur complement 
S' by 5*4 = blkDiag{Sii) as follows 

S^^ = Gu - FuD^Eu, z = 1 : 4 (17) 
where the subscript 4 in S4 is the number of principle sub matrices of matrix G and we call Su a MSG. 



Remark 2 For simplicity, our approach was to partition the matrix recursively into block 2x2 matrix. We 
could directly identify the blocks Gu such that the following expression 

Su = Gu ' FuDr^Eu.i = 1 : m (18) 

makes sense. Here m denote the number of MSCs desired. 



For notational convenience, we denote diagonal blocks of D by Dm — blkDiag{Du), similarly, we denote 
Em ~ blkDiag{Eii), Fm — blkDiag^Fu), and Gm ~ blkDiaglGu). Thus, the matrix Cm in the general case 
is given as follows 







J? 





Thus, we have 5*^ ~ Gm — Fm{Dm) ^Em which is an approximation to the original Schur complement 
S = G — FD^^E. When it is not necessary, we shall omit the subscript m from Sm- 



We are now in a position to formally define the MSGN preconditioner. 



Definition 1 Given a block 2x2 partitioned matrix C as follows 



D 


E ' 


F 


G 



c = 

The MSCN preconditioner denoted by Bmscn is defined as follows 



B 



MSCN — 



D 
F S 



D E 
S 



(19) 



Ifm is the number of MSCs considered, then S — blkDiag{Sii, . . . , Smm), where Su is given by the formula 
(18) above. 



Definition 2 The LUM method is defined as above except that S ~ blkDiag{Gii, . . . , Gmm)- Here m is the 
number of MSCs. 



Tiieorem 1 If D defined above is symmetric positive definite, then exists. Thus, the MSCN precondi- 
tioner exists. 
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Proof: If the (1,1) block D of the matrix C is SPD then each Da (being the diagonal blocks of D) for all 
1 < i < m are SPD and the formula ( 18 1 does not break down. 



Lemma 1 0/ Given a block 2x2 partitioned matrix as follows 



Z = 



D E 
F G 



where the sub matrix D is nonsingular, then the inverse of the matrix Z is given as follows 



z-i = 



D-^ + D-^ES-^FD-^ -D-^ES-^ 
-S-^FD-^ 



where S = G~FD-^E. 



Proof: The proof seems to appear first in The result is known as Banachiewicz inversion formula for 
the inverse of a block 2x2 partitioned matrix. 



Theorem 2 Gonsider the block 2x2 partitioned matrix as follows 

C 



Dpxp E 



F G 

and Bmscn be the MSCN preconditioner as defined in Def. Then the MSCN preconditioned matrix C 

i.e., (B]v[scn)^^C hasp eigenvalues exactly equal one. The rest of the n-p eigenvalues are the eigenvalues 
of iSm)~^S, where S = G — FD~^E is the complete Schur complement of the matrix C. 



Proof: Using Lemma ([I]) above, we have 



-^MSCN 



Syy. 



D 



pxp 



-^pxp 

S-'FD-^ 







(20) 



and the MSCN preconditioned matrix is given as follows 



/pxp D-^EiJ-S-^S) 
S^fi s 



where S = G — FD ^E. Hence the theorem. 



(21) 



Remark 3 In theorem ^ above, we have proved the result for left preconditioned matrix. However, a 
similar result holds for Right preconditioned matrix as follows 



{I 



^pxp 

SSz})FD'^ 







(22) 



where S = G-FD-^E. 
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Table 1: Top Left: Real part of the eigenvalues of original matrix, Top Right Real part of the eigenvalues of 
the MSCN right preconditioned matrix, Bottom left: Eigenvalues (Real and imaginary) of original matrix, 
Bottom right: Eigenvalues (Real and imaginary) of MSCN right preconditioned matrix 




Remark 4 Comparing the express ion for the left preconditioned matrix in Equation ( 21 ) with that of right 
preconditioned matrix in Equation (22), we find that SS^ and S^^S are similar matrices. Thus, we have 
that the n-p eigenvalues for left preconditioned matrix given by eigenvalues of SS^^^ are same as the n-p 
eigenvalues of the right preconditioned matrix given by eigenvalues of S . In practice, the original matrix 
C is indefinite as confirmed in top left figure in Table UV where we notice many negative eigenvalues. 



Remark 5 In general, it is difficult to estimate the eigenvalues of S^^S since we need some additional 
assumptions for the sub matrices of E and E which, in practice, are unknown. However, it may be a good 
idea to partition the matrix which leads to a large (1,1) block D, and as a consequence more and more 
eigenvalues are equal to one. On the other hand, since we will need to solve equations of the form Dx = y, 
so ideally D should be partitioned into .small diagonal blocks that are easier to invert. Thanks to several 
graph partitioning softwares readily available, we could easily obtain such reordering. Some of the popular 
partitioning and reordering softwares available in the public domain are namely, METIS JJ7| / and independent 
set ordering used in \33\. \32^ . Use of such partitioning and reordering techniques leads to a purely algebraic 
domain decomposition which takes matrix as an input and partitions the graph ( rather than the computational 
domain) by selecting a separator (set of edges or vertices), removal of which leads to several disconnected 
subgraphs. 



The quality of the preconditioner is determined by the quahty of the Schur complement approximation and 
the spectrum distribution of S^^S. 



Consider the block 2x2 partitioned matrix as follows 

C = 



D E 
E G 



where D = blkDiag{Di, D2), 



E = 



Ell E12 
E21 E22 



F = 



Fii F12 
F21 F22 



G = 



Gil G12 
G21 G22 



The Schur complement S is given as follows 



S 



Gil — EiiD^ ^Fii — E12D2 ^F2i G12 — EiiD^ ^Fi2 — E12D2 ^-^22 
G21 — E2iD^ Ell ^ E22D2 F21 G22 ^ E2iD^ E12 — E22D2 E22 



(23) 



The approximation to the Schur complement via MSCN method for this model problem is given as follows 



5* = 



Gil ~ EiiD^ ^Fii 



G22 — E2iD^ ^Fi2 



(24) 



Thus, S is an approximation to the block Jacobi preconditioner for S. Thus, for the general case with many 
subdomains, S remains a crude approximation to the block Jacobi preconditioner of S. 



3.4 Previous work 



The methods presented in this work are related to patch method as introduced in [551 [55]. However, there 
are significant differences. We list these differences as follows 
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1. First, we start with an aggregate, i.e., a group of nodes, whereas, in |2Hl ES] the Schur complement 
is built around a node. For a sufhciently large aggregation, the columns corresponding to the whole 
group of nodes of an aggregate are approximated, resulting in a much faster sweep in the graph. 

2. The aggregation scheme for the MSCN method considered in this paper is based on the numbering of 
the nodes in the grid. On the other hand, in the patch method, a small patch is formed around a node 
based on the edge connections. For example, in Figure ([T]), the nearest neighbors of node numbered 
12 are nodes numbered 11 and 13. But for the MSCN method, the nearest neighbors of 12 would be 
nodes numbered 13 and 14. 



4 Exploiting parallelism and implementation aspects 

The methods we have seen in the previous section are massively parallel both during the setup phase as well 
as during the iteration phase. In this section, we understand the parallelism in the method. In Algorithm 
([T]), we provide the pseudocode for the construction phase of the MSG methods and in Algorithm ^ the 
solution procedure is presented. 

We start with the aggregation process by selecting a set of aggregated nodes, and subsequently we construct 
mini Schur complement for each aggregate. The construction of each of the mini Schur complements are 
independent of each other. Thus, if N is the time required to compute the approximated global Schur 
complement sequentially, then since each of the mini Schur complements could be computed by a processor, 
ideally, we have a decrease in the computation time for the setup phase of the preconditioner by N/p, p 
being the number of processors. 

Obviously, the size of the MSCs should be large enough and should involve significant computational work 
compared to the overhead involved in setting up the parallel task. 

For MSCN and MSCE methods, the global Schur complements are block diagonal matrices, see left figure 
in Table Thus, the factorization phase of the approximated global Schur complement is also completely 
parallel. Moreover, due to the same reasons, the solve phase with approximate Schur complement can de 
done in parallel. In Figure ([2]), we show the scalability in the construction phase of the MSCN preconditioner 
for a leaky lid driven cavity problem on 32 x 32 grid. We observe that on four cores, the construction phase 
has a speedup of about two. The scalability result is obtained with parallel computing toolbox of Matlab. 
The parallel program for construction is easily implemented by replacing the keyword "for" by the keyword 
"parfor" as follows 

parfor i = 1 to number of MSCs 
end 



Here the sub matrix Da is reasonably small to be inverted easily by decomposing it into exact triangular 
factors: [La^Uu] = lu{Dii). While computing D~-^Eii, we achieve one more level of parallelism by having 
factorized the matrix Da into a product of lower and upper triangular factors, and then solving with column 
of Eii as the right hand side. On the other hand, in the overlapping MSG case, although the computation of 
MSG can be done independently, the resulting approximation to the global Schur complement is not block 
diagonal as seen in the right figure in Table ^. 
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Algorithm 1 PSEUDOCODE TO CONSTRUCT B FOR MSCN,MSCNR,LUM,OMSCN,OMSCNR, and 

PLUM methods 

INPUT: 



" D 


E ' 


F 


G 


= 





• k = Number of MSCs desired 

• w = array of length k contains size overlap, Wk 

1 1 Find principle submatrices (aggregates) 

if MSCN, LUM, or MSCNR then 

Find principle submatrices of G, D, F ^ and E as follows 
Pg = {Gii,G22, • ■ ■ ,Gfefc}, Gii = G{r, : r^+i - l;ri : r^+i - 1) 
Pd = {Dii,D22, ■ ■ ■,Dkk}, Da = D(ri : n+i - l;ri : r^+i - 1) 
Pf = {Fn, F22, . . . , Fkk}, Fu = Fin : n+i - 1; n : n+i - 1) 
Pe = {Eii,E22, ■ ■ -^Ekk}, Eii = Fin : n+i -l;ri : n+i - 1) 
Here, ri — 1, Vk+i ~ ncols{G) + 1 

else if OMSCN, OLUM, OMSCNR then 

Find principle submatrices of G, D, F, and E as follows 
Pg = {Gil, G22, • • • , Gfefe}, Gii = G{ri : n+i - 1 + Wi;ri : n+i -1 + Wi) 
Pd = {Dii,D22, ■ ■ -.Dkk}, Da = D{ri : n+i - 1 + WjiTj : n+i - 1 + Wi) 
Pf = {Fii,F22, Fkk}, Fu = F{ri : n+i - 1 + rj : rj+i - 1 + u;,) 
Pe = {En,E22, ■■■■ £'fcfc}. En = Fin : r^+i - 1 + w^; r, : r^+i - 1 + u;,) 
Here, ri = 1, r^+i = ncols{G) + l,Wi< (r^+i - r,) 

end if 

// Now construct MSCs 
if MSCN or OMSCN then 
for i to do 

Six Gu PiiP^ii 

end for 
else if LUM then 

for i to k do 

Sii — Gii 

end for 

else if OMSCNR or MSCNR then 
for i to do 

^ii — Gjj 

where 1 
end for 
end if 

Set S = blkDiag{Sii,S22, Skk) 
OUTPUT: 

B = 



-FiiD-^\Eul) 
[1,1,.. 



D 
F 



D- 



D 



E 
S 



(25) 



Here B is either of the MSC based preconditioners 
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Algorithm 2 PSEUDOCODE TO SOLVE WITH B FOR MSCN,MSCNR,LUM,OMSCN,OMSCNR, and 

OLUM methods 

OBJECTIVE 



To solve 



B 



D 

F S 



D- 





' D E ' 




Xl 




yi 




S _ 




. ^2 




. 2/2 . 



(26) 



• D = blkDiag{Aii, A22, Amm), S = blkDiag{Sn, S22, • • • , Skk) 

• m := number of partitions, k := number of MSCs, y := Right hand side 

/ / forward sweep 
for i= 1 to m do 

zii = A'^'^iuii). Here zi = [zn, zim], yi = [yii, ■ ■ ■ , yim] II Can use an inexact solve 
end for 
for i=l to fc do 

Z2% = S^^^{y2i - {Eziji). Here Z2 = [221, Z22, ■ ■ ■ , -Z2fc], 2/2 = [2/21, 2/22, 2/2^]- // Can use an inexact 
solve 
end for 

// backward sweep 

Set X2 = Z2 

for i=:l to m do 

xim = zim — A'^^{Fx2)i, Xl = [xn, a;i2, . . . , xim]- / / Can use an inexact solve 
end for 
OUTPUT: X 
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Figure 2: Scalability curve for the construction of MSCN for a 32 x 32 grid leaky lid driven cavity problem 
on four cores using parallel computing toolbox of Matlab 7.10 




Number of cores 
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Figure 3: Q2-Q1 discretization for leaky lid driven cavity for h=l/128 stretched grid (stretch factor=1.056) 
with Reynolds number of 1000 




5 Numerical experiments 

The numerical experiments were performed on Matlab 7.10 in double precision arithmetic with multi- 
threading enabled on a core 17 (720QM) Intel processor with 8 processing threads. The system had 6GB 
of DDR3 RAM and 6MB of L3 cache. The iterative accelerator used is restarted GMRES with subspace 
dimension 300. We keep the subspace dimension large to keep avoid the effects of restart. The maximum 
number of iterations allowed is 3000 and the stopping criteria is the decrease of relative residual below 10~^. 
The given coefficient matrix is scaled by dividing each row by an entry on the same row with maximum 
absolute value. For the sake of comparison with sequential PCD and LSC methods, the experiments with 
the new methods are also done sequentially. The test set consists of standard leaky lid driven cavity problem 
defined on both uniform and stretched grid. The discretization scheme used is the Q2-Q1 (biquadratic veloc- 
ity/bilinear pressure) mixed finite element discretization with viscosity varying from 0.1 to 0.001. A sample 
of the discretization of Q2-Q1 scheme for h = 1/64 for stretched grid with stretch factor of 1.056 with more 
finer discretization near the boundaries is shown in figure These problems are generated using the IFISS 
software |11] . We compare all three important aspects of the methods, namely, the storage requirements, 
i.e., the fill factor, the iteration count, and the CPU time. For MSCE method, the parameters pi and are 
taken to be equal to "sz" defined in Table ([S]). 

5.1 Leaky lid driven cavity 

In Table Q, we present the results for the uniform grid. We compare all three important aspects of the 
methods proposed namely iteration count, CPU time, and fill factor which is the ratio of the non non-zeros 
in the preconditioner and non-zeros in the original coefficient matrix. We compare new methods with the 
Pressure convection diffusion and Least square commutator methods as implemented in IFISS MATLAB 
toolbox. In the tables, nA denote the number of diagonal block in (1,1) block and nS denote the number 
of independent Schur complement computations. We use an incomplete LU with tolerance 10~^ as an 
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Tabic 3: Notations used in tables of numerical experiments 



Notations 


Mcciiiiiig 




IXFiiTTiriPT" AT n T^f'T'p'hi 7?i"t"ion ■nnmt'Q 


SZ 


S^i7P (~\T tnp A/Tmi StpViiiT" pmnnl pmpni" 


nA 


Nnrnlipr of inrlpDPTirlpTit Vilnrlcs in tVio 1 1 1 i lilnrV 


nS 


lN[nmV)pr of IVTSOs ronsidprpd 


sA 


T'VirPsViolrl mitip of tViP HiaPTinal Hloflcs of il 1 i lilorlc 


tolA 


T'nlpr^inr'p for tfip inronrnlptp TjTT j^Tiriro'v for (1 1 i filorV 


its 


J. IjC-L Cl Ulvjl-i. V^WUlllj 


time 


t"iTnp TOT" POTitifnipfiOTi 3Tirl Qoni1"iOTi PVplnriPQ tifit'1"i1"ioti i"iTnp 


MSCN 


^/l ITll ^PnilT" Pf~lTYl T\ PTTl PTli" ISTl n Til 1 TTl hpT"l Tl hfl QpH f1 CfO'T'PCrfl i"l f~»Tl 


OMSCN 


Overlapped mini Schur complement with numbering based aggregation 


LUM 


Lumped approximation 


MSCE 


Mini Schur complement with edge based aggregation 


MPCD 


Modified pressure convection diffusion 


LSC 


Least square commutator 


NA 


Not applicable 


NC 


Not converged 




Test abandonned due to relatively large time 



inexact solver for the (1,1) blocks. As expected the number of iterations for the PCD and LSC methods 
seem to be independent of the mesh size h. But for the new MSC based methods there is slight increase 
in the iteration count. This is expected for a domain decomposition based method; more the number of 
subdomains more the iteration count, but more the parallelism. Nevertheless, the CPU times of the new 
methods are better compared to both PCD and LSC with many independent blocks. For instance, for the 
h — 1/32 we have 5 subdomains for the (1,1) block and 11 subdomains for the Schur complement, while for 
h — 1/128, we have 12 subdomains for the (1,1) block and 31 independent Schur complement computation. 
On comparing the dependence of iteration count with increasing Reynolds number, we observe that iteration 
count increses mildy until Reynolds number is equal to 1000. We notice that the system remains in steady 
state untill Reynolds number of 1000 as seen in figure Q. For Reynold number larger than 1000, we expect 
the unsteady state with several vortices in the strealine contours as shown in figure ([5]); the velocity contours 
suggests that the motion is nearly chaotic. Comparing with PCD and LSC for Reynolds number higher 
that 500 the iteration count for PCD and LCS are higher that the MSC based preconditioners. The size of 
each of the MSC are kept equal and they are equal to sz in the table. The overlap for the OMSCN and 
OMSCNR methdods are equal to sz. Clearly, the overlap shows some improvements since, the approximation 
is better compared to the non-overlaping case. We have an interesting observation that for LUM shows same 
iteration count as MSCN, the reason is that the En and Fa blocks are very sparse and often zero and thus 
Gii~ FiiDYi^ Eii Gii. However, this is only true for the uniform grid, for stretched grid in Table ([s]) MSCN 
converges faster compared to LUM method. But all in all for both the grid OMSCN and OMSCNR seems 
to have less iteration count and less CPU time compared to the PCD and LSC methods. For MSCE method 
the convergence time was found to be very large compared to all the methods and it will be investigated in 
future work. 



6 Conclusions 



In this work, we proposed a new class of algebraic approximation to the Schur complement based precon- 
ditioners for Navier-Stokes problem, we observe the superiority of the methods compared to PCD and LSC 
method especially for Reynolds number larger than 100. The proposed methods arc highly parallel, and very 
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Figure 4: Left: Velocity streamlines, Right: Pressure field for leaky lid driven cavity for h=l/128 stretched 
grid with Q2-Q1 discretization with Reynolds number of 1000 




Figure 5: Left: Velocity streamlines, Right: Pressure field for leaky lid driven cavity for h=l/128 stretched 
grid with Q2-Q1 discretization with Reynolds number of 3000 
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suitable for modern day multiprocessor and/or multi-core systems. 



As a future work, an implementation of the methods in parallel may be done and an extensive comparison 
with other existing preconditioners for the Navier-Stokes systems will be added. A coarse grid corrections 
may be introduced which may improve the convergence of the methods significantly, however, it may very 
well be the bottleneck in achieving overall parallelism in the two level methods. 
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